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ABSTRACT 

The knowledge that potential guanine quadruplex 
sequences (PQs) are non-randomly distributed in 
relation to genomic features is now well established. 
However, this is for a general potential quadruplex 
motif which is characterized by short runs of 
guanine separated by loop regions, regardless of 
the nature of the loop sequence. There have been 
no studies to date which map the distribution of PQs 
in terms of primary sequence or which categorize 
PQs. To this end, we have generated clusters of 
PQ sequence groups of various sizes and various 
degrees of similarity for the non-template strand of 
introns in the human genome. We started with 
86697 sequences, and successively merged them 
into groups based on sequence similarity, carrying 
out 66 clustering cycles before convergence. We 
have demonstrated here that by using complete 
linkage hierarchical agglomerative clustering such 
PQ sequence categorization can be achieved. Our 
results give an insight into sequence diversity and 
categories of PQ sequences which occur in human 
intronic regions. We also highlight a number of 
clusters for which interesting relationships among 
their members were immediately evident and other 
clusters whose members seem unrelated, illust- 
rating, we believe, a distinct role for different 
sequence types. 

INTRODUCTION 

The occurrence of potential guanine quadruplex sequence 
motifs (PQs) within non-telomeric nucleic acids has been 
the subject of a number of studies (1-14) (for reviews, see 
refs 15 and 16) and several databases and web resources 
are available (17-21). Most of the emphasis of these 
surveys has been to examine the number of PQs and the 
genomic regions in which they occur. Several studies of 



individual and specific sequences at a small number of loci 
have been carried out. In particular, PQs associated with 
the promoter regions of the c-kit (22-25) and c-myc 
(26,27) genes have been examined in detail, as well as 
the 5'-untranslated region (UTR) region in several other 
genes including N-ras (28) and zic-1 (29). Apart from our 
initial analysis describing loop sequences within PQ 
sequences (1), there has been no systematic classification of 
PQs in terms of their primary sequence. Crystallographic, 
nuclear magnetic resonance (NMR) and modelling studies 
have demonstrated that the topology of guanine quad- 
ruplexes is very dependent on their primary sequence, as 
found, for example, in various human telomeric sequences 
(30-33), and the two c-kit sequences (22-25). Biophysical 
studies of loop size (34-36) and analyses of the effects of 
sequence in single-base loops (37) also confirm this 
conclusion. 

From the outset of sequence-based studies into poten- 
tial quadruplex sequences in non-telomeric nucleic acids, it 
has been clear that there are more sequences than can be 
experimentally studied, and to date only a very small 
fraction of the individual sequences have been examined, 
although there have been attempts to establish some more 
general rules governing the energetics of quadruplexes 
(38). Our initial survey of PQs in the human genome 
showed that there are 226 157 unique sequences that 
concur with our search criterion (1). In the same study, 
we carried out a detailed examination of loop sequences 
and established that in terms of sequence space, the dis- 
tribution of loop sequences is far from random, with some 
being very common and many others not appearing at all. 
However, examining loop sequences in this way is prob- 
lematic since, in instances with variable numbers of 
guanines in the G-tracts and/or isolated guanines in loop 
sequences, it is not currently possible to determine which 
guanines are part of the loop and which are part of the 
G-quartet core, in the absence of relevant experimental 
data. When more than four G-tracts are present in a 
sequence, we have the additional problem of determining 
which of them would participate in a more stable 
quadruplex structure. We thus need a more practical 
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and robust way of studying quadruplex sequences in detail 
than trying to derive information from loop sequences 
alone. In this study, we consider the sequences of poten- 
tial quadruplex-forming regions as a whole rather than 
their component parts (G-tracts and loop regions) and 
describe a method for finding groups of similar 
sequence. This removes any need to make prior assump- 
tions about topology. Finding many examples of a 
complex sequence is compelling evidence of positive selec- 
tion. The possibility therefore exists that quadruplex struc- 
ture is the reason for such selections. Of the clusters which 
contain sequence that are proven to form G-quadruplex 
structures, there is also the possibility that similar se- 
quences may also form similar folding topologies. We 
have used the non-template strand of introns in the 
human genome to develop our method and at the same 
time to produce new data on quadruplex sequences within 
introns. Our goals are therefore to develop a method to 
find groups of similar short sequence, apply it to potential 
guanine quadruplex sequences and, subsequently, deter- 
mine whether one can find correlations within these 
groups or something to link the genes in which the se- 
quences occur. In addition, the application of this 
method can be seen as a hypothesis-generating exercise, 
as the potential guanine quadruplex clusters can be used 
as a starting point for further analyses. We have therefore 
chosen a number of clusters to illustrate that different 
types of correlation can be found in the clusters. 

Hierarchical agglomerative clustering is a method with 
which one starts with the individual data and merges the 
most similar (39). The resulting clusters are then succes- 
sively merged until only one cluster remains. One ends up 
with a grouping of data in a dendrogram where, at suc- 
cessive levels, the cluster members are less similar. This is 
schematically represented in Figure 1. In order to cluster 
nucleic acid sequences in such a way, a similarity metric is 
needed, and for this, pair-wise sequence alignments were 
carried out and a similarity score was obtained. Once a 
similarity metric has been established, there are a number 
of ways to compare clusters. In this instance, the complete 
linkage method has been used, where the distance between 
two clusters is the score of the least-similar, longest 
distance between any member of one cluster to any 
member of the other. 



METHODS 

All genomic data were taken from the ENSEMBL 
database (40) homo_sapiens_core_57_37b and the non- 
template strand sequences were extracted from the 
intronic regions for all genes with status 'known'. The 
same method was used in earlier studies (1,5) to gather 
the G-rich sequences with the pattern: Gj,s^i-iGt,s 
L 1 _ 7 G3_5L 1 _ 7 G3_5, where G represents guanine bases 
and L represents any base including guanine. The 
ENSEMBL perl API was used to extract the genomic 
regions of interest and in-house software written in C++ 
used to extract the PQ regions. Regions that had more 
than four G-tracts were treated as a single sequence. A 
total of 101926 potential quadruplex-forming regions 
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Figure 1. The clustering process begins at the top, where the individual 
data are treated as clusters. The most similar data are merged, the 
similarity between the new clusters is derived and the most similar of 
those are merged until all of the data are in the same cluster. 



were extracted; however, a number of identical sequences 
were identified in this set, giving 86 697 unique sequences. 
The sequences were then collated into a mySQL database, 
along with information about their genomic locations. 

Sequence alignments and clustering were carried out 
using in-house software written in C++. The Smith- 
Waterman method was used as described by Durbin 
et al. (41) to carry out the individual alignments. The 
scoring scheme is quite simple since all mismatches are 
considered equal. Match = 1, mismatch = 0, gap = —0.5 
edge-gap = 0. Edge-gaps are the over-hanging part at the 
end of the sequences which arise from the fact that the 
sequences are often of different lengths, so edge-gaps are 
inevitable and therefore less costly than gaps within the 
sequence. 

The scoring for the clustering was carried out in a dif- 
ferent way from that of the pair-wise sequence alignments, 
since there is a different purpose for each of these. It was 
done by counting the number of gaps and mismatches 
dividing by the number of potential matches. The 
maximum number of matches in any alignment is the 
length of the shortest sequence. Since gaps in the longer 
sequence lead to fewer matches, we only penalize gaps on 
the shorter sequence. 

The scoring scheme used is described in the following. 

Details of this scheme and our rationale behind it can be 
found in the Supplementary Data. To avoid confusion 
these are not being called 'alignment scores' because 
they were not obtained when the alignments were 
carried out, but rather they are 'similarity scores'. This 
now provides a metric of sequence similarity which is inde- 
pendent of the lengths of the sequences involved. 
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We also obtained more clear-cut results for the cluster- 
ing by separating the two scoring schemes. The pair-wise 
sequence alignments were done to find the 'best' alignment 
between the sequences, and the scores for the clustering 
were calculated so that one pair-wise sequence alignment 
can be compared with another. It was necessary to score 
them independently of size since the pair-wise align- 
ments can be of varying size. (Supplementary Figure SI 
and Table S4 illustrate how the clustering was biased 
towards longer sequences being clustered first when 
using the alignment scores for the clustering from a 
subset of 1000 sequences chosen at random.) 

The scoring scheme that we developed was effectively 
our definition of sequence similarity and had to compare 
alignments of various lengths. There are many ways in 
which we could score our alignments, depending on how 
we define sequence similarity, which would possibly 
produce differing results, e.g. scoring mismatches higher 
than gaps might be sensible if we decided that guanine 
quadruplex loop length was more important to stability 
than base composition. However, we wish to assume as 
little as possible and so have kept the scoring scheme as 
simple as we can. We believe that the results from the 
method that we settled on indicates that it fits the 
purpose well. 

To compare whole clusters, full-linkage hierarchical ag- 
glomerative clustering was used. When attempting to use 
single-linkage and mean linkage hierarchical agglomera- 
tive clustering, it was found that the clusters were subject 
to an unacceptable amount of chaining, where unrelated 
sequences can end up belonging to the same cluster. This 
method was very computer-intensive since to measure 
the similarity score between two clusters, every pair of 
sequences between the two clusters must be aligned. The 
similarity matrix was too large to be held in computer 
memory (~28Gb of data). It was found that it was 
faster to pre-compute all of the sequence alignments 
(3 758 141 556 alignments), calculate the comparison 
scores and store them on a hard drive, since looking up 
the scores from the hard drive was faster than carrying out 
the alignment and obtaining the similarity scores on the 
fly. 

The clustering process went as follows: 

(i) Set similarity threshold to 1 and consider each 
sequence as a cluster. 

(ii) Compare all clusters and when a pair is found 
which has a score equal to or better than the simi- 
larity threshold, merge them together. 

(iii) Repeat stage 2 until there are no longer any pairs of 
clusters at or above the similarity threshold. 

(iv) Decrease the similarity threshold by 0.05 and go 
back to stage 2. 

The process at stage 2 is traditionally performed by 
merging the best pair of clusters and re-calculating the 
similarity scores between the newly formed cluster and 
the remaining clusters. However, this would have taken 
an impossibly long time with such a large number of 
sequences, since cluster comparisons are very costly in 
terms of computer time. To expedite the process, a 



coarse-grained approach was adopted which merged 
many of the clusters in a single cycle and greatly reduced 
the number of cluster comparisons that were carried out. 
By decreasing the similarity threshold by 0.05 increments 
every time, the process was greatly speeded up; we suggest 
that the outcome was not significantly different from what 
would have happened if it were practical to cluster by 
re-calculating the score matrix after every merging. 
(A comparison of the performance of both methods can 
be found in Supplementary Figure S2.) The clusters 
formed at the last cycle before the similarity threshold 
was dropped, thereby being of most interest. The degree 
of similarity of the cluster members can be derived from 
the similarity threshold and hence is related to the cycle 
number. The earlier in the clustering, the more similar are 
the cluster members. 

Several prominent clusters were chosen and dendro- 
grams drawn with software that was developed in-house, 
using the Python Imaging Library and the aggdraw 
module, in the Python programming language. We also 
carried out multiple sequence alignments between the 
cluster members for the purposes of illustration using 
ClustalW (42). 

We used FuncAssociate 2.0 (43) which employs the 
Fisher's exact test to determine the probability that gene 
ontology (44) (GO) terms are over-represented [the null 
hypothesis is that it is unsurprising that the number of any 
particular GO term appears in the test set (Cluster) by 
chance]. Since it is not impossible to find false positives 
when looking for correlations in large sets of data, 
FuncAssociate calculates an adjusted P-value that 
includes an estimation of the probability of obtaining at 
least one false positive. The list of genes belonging to each 
of the clusters produced by Cycles 5, 9, 13, 17, 20, 23, 26, 
29, 32, 36, 39, 42, 45 and 48 whose sequences were 
associated with more than 10 different genes were sent 
to the FuncAssociate server. A P-value cut-off of 0.05 
was used to determine which clusters were over- 
represented in any GO term. 

RESULTS 

Cluster size distribution 

Figure 2 shows how the number of clusters decreases at 
each cycle. The threshold level is also shown and it can be 
seen at which cycles the similarity threshold was decreased 
and how this affects the number of clusters. For example, 
between Cycles 7 and 8 the largest drop in the number of 
clusters occurs, from 72106 to 56049 clusters. The next 
significant drop, between Cycles 11 and 12 (55 802-39 852 
clusters), is almost as large. These are, therefore, the stages 
with the largest number of clusters merging and coincide 
with the similarity threshold decreasing from 0.95 to 0.9 
and from 0.9 to 0.85. 

Figure 3 shows the cluster size distribution changing 
with each cluster cycle, for clusters containing between 1 
and 400 sequences and on the final column for clusters 
larger than 400. The clusters were arranged in bins de- 
pending on the number of sequence members which they 
contained, starting with clusters with 0-10 members, then 
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Figure 2. The relationship between the similarity threshold and the number of clusters during the progression of the clustering process. 




10-20 members and so on until the clusters along the 
right-hand side with 400-86 696 sequence members. The 
heights represent the total number of sequences within the 
clusters in a particular bin and each coloured row repre- 
sents a cluster cycle. It can be seen that, as expected, the 
sequences are initially distributed among the small clusters 



(0-10) and it is not until Cycle 5 that there are clusters 
with greater than 10 sequences in them. By Cycle 27 the 
number of clusters containing 0-10 sequences is dropping 
significantly and the sequences are distributed among 
larger clusters and by Cycle 31 there are clusters which 
contain more than 400 sequences. As the process of 
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merging clusters continues, the distribution moves to the 
right until by Cycle 56 there are no longer any clusters 
below 400 sequences and finally at Cycle 65 the clustering 
has converged into a single cluster. 

There are a very large number of clusters and to carry 
out a detailed manual analysis of all of them would be 
unfeasible. We have therefore taken several clusters and 
highlighted some interesting features within them. 
Diagrams of these clusters show the multiple sequence 
alignments calculated using the program ClustalW next 
to the dendrograms generated from the clustering data. 
Many of the groupings in the dendrogram on the right 
can be correlated to features of the ClustalW-aligned se- 
quences even though they were derived through different 
means. For example, in Figure 5a sequences 8-10, which 
share very similar sequences over the first 17 bases, are 
grouped together much earlier in the clustering process 
than they are with the rest of the sequences in the 
cluster which differ in this region. 

Figure 4 and Table 1 show Cycle 27 cluster number 
4470, which contains a cluster of human telomere and 
human telomere-like sequences with the potential to 
form quadruplex structures. Azzalin et al. (45) and 
Schoeftner and Blasco (46) showed that telomeres are 
not transcriptionally silent and that the C-rich strand 
is transcribed more than the G-rich strand, resulting in 
r(UUAGGG),, being more abundant than r(CCCUAA)„. 
These G-rich RNAs can interact with telomeric DNA 
and also with the telomerase RNA template and thus 
inhibit the catalytic action of the telomerase enzyme 
complex. They can also interact with other gene 
products such as that of SMG which are also involved 
in the maintenance of telomeres. The clusters that we 
have here are examples of an area where this new class 
of RNA could also be transcribed. Although these se- 
quences are within introns, it is not inconceivable that 
they can exist alone or as part of smaller molecules after 
splicing and digestion. For example, it was been observed 
(47) that while in the quadruplex form, G-rich telomeric 
RNA is immune to digestion by Tl nuclease, which 
normally cleaves RNA after a single-stranded guanine 
residue. Further detail on these clusters is given in 
Supplementary Tables IS and 2S. Locating telomeric 



repeats in non-telomeric DNA has been previously 
observed, albeit not at the sequence level — for example 
Meyne et al. (48) discussed their distribution in 100 verte- 
brate species. 

Figure 5a and b and Tables 2 and 3 show clusters which 
are mainly composed of closely related zinc-finger genes. 
The members of the cluster in Figure 5a all belong to the 
same interpro (49) families, IPR001909 Krueppel- 
associated box and IPR007087 Znf_C2H2. They occur 
at 13 different locations, with 10 unique sequences. The 
location of the sequences within the genes is similar for 
most of these genes, usually about 200-300 bases from the 
beginning of the first intron. The variable parts of the 
sequence tend to be outer 'loops' while the central GGG 
AGGG core appears to be conserved. This is also 
conserved in another very similar cluster shown in 
Figure 5b. The majority of those genes belong to the 
same interpro families, IPR001909 Krueppel-associated 
box and IPR007087 Znf_C2H2. Sequence 7 is shared by 
two genes which overlap, AC0 10300.1 and ZNF91. 
ZNF91 being contained entirely within an intron of 
AC010300.1. Almost all of these genes are found in the 
same area of chromosome 19; however, two genes are 
found in entirely different locations, ZNF107 is found 
on chromosome 7 and MAP IB is found on chromosome 
5. Although MAP1B is an unrelated gene, its expression 
has been shown to be controlled by the zinc finger gene 
BCL11A which also belongs to interpro family IPR007087 
Znf_C2H2 (50). When looking at the variable and 
conserved regions, we need to be aware that the search 
criterion may have an effect on what we see, i.e. be cog- 
nizant of the fact that we will always have conserved runs 
of guanines in the sequences. It may be more instructive to 
look at the conservation of the loop sequences; however, if 
the guanine runs are longer than three bases, there is scope 
for variability around the edges, under our search criter- 
ion. The cluster in Figure 5a appears to be more variable 
in the region of the first loop while the central 'A' loop is 
the same throughout and the third loop 'TCAT' has only 
one difference, a substitution of an adenine for a cytosine. 
Since the final G-runs are longer than three bases, we see 
two cases where the guanines are substituted for an 
adenine and for a thymine. 



1 GGGGCTAGGGTTAGGG- 

2 GGG- 

3 GGGGCTAGGGTTAGGGTTGGGTTAGGGTTA- -GGGTTAGGG- 

4 GGGTTAGGG- 

5 GGGGCTAGGGTTAGGGTTGGGTTAGGGTTA- -GGGTT-GGG 

6 GGGTTAGGGTTA- -GGGTT-GGG 

7 GGGTTAGGG 

8 GGGGTTAGGGTT- GGGG 

9 GGGTTAGGG TT- GGGG 

10 GGGTTAGGGTTTGGG 

11 GGGTTAGGGTTGGGG 

12 GGGTTTAGGGTTTGGGTTTGGGTTAGGGCTTAGGGG 

13 GGGTTTGGGTTGGGTTTAGGGTTAGGGTTTAGGGTTAGGG- 



14 - 



-GGG 



15 GGGTTCGGG 

16 GGGTTGGGTTAGGGTTGGGTTAGGGTAGGGTTAGGGTTAGGGGTTAGGGG 

17 GGGTTGGGTTAGGGTTGGGTTAGGGTAGGGTTAGGGTTAGGGGTTAGGGG 

18 GGGTGGGAG 

19 GGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTTAGGGTTAGGGGTTGGGGTTAGGGGTTGGTTAGGGTTAGGGGTTAGGGG 

20 GGGTTAGGGG 

21 GGGTTAGGG 

22 GGGTTAGGGGTTAGGGGTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGG 

23 GGGTTAGGGTTA- -GGGTTAGGG- 

24 GGGTTAGGGTTA- -GGGTTAGGG 

25 GGGGTTAGGGTTAGGGG- 

26 GGGTTAGGGC 

27 GGGCTTAGGGC 

28 GGGCTTAGGGC 

29 GGGATTAGG6T 
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Figure 4. Telomeric like quadruplex sequences. 
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Table 1. Telomeric sequence clusters 
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Within Cycle 18, several clusters were found to be 
over-represented in the GO term GO:0003823 'antigen 
binding'. Cycle 18 cluster 13 461 (Figure 6 and Table 4) 
is one such cluster, which consists mainly of L1R genes 
(leucocyte immunoglobulin-like receptor). These genes are 
all found in the same genomic location: region 19ql3.4. 
All but one of the genes in the cluster occur at this 
location; however, since some of the genes are over- 
lapping, the total number of locations is 11. Cycle 18 
cluster 448 (Figure 7 and Table 5) contains a number of se- 
quences which occur within two immunoglobulin genes, 
IGHA2 and 1GHM which contain a number of very 
similar sequences. A third IGH gene IGHV3-6 is a pseudo- 
gene; however, since certain pseudogenes may play an im- 
portant role in regulation (51,52), this may still be a 
biologically relevant locus. Three other genes which 
appear in this cluster, TR1M29, ZNF831 and BRSK2, 
are unrelated to the immunoglobulins. A similar cluster, 
Cycle 18 cluster 1086, (Figure 8 and Table 6), contains the 
same immunoglobulin genes and similar sequence motifs. 
This also contains three non-immunoglobulin genes 
KCNK2, SMAD and the same kinase gene as found in 
the aforementioned cluster, BRSK2. Closer examination 
of the regions in which these sequences occur in both the 
immunoglobulin genes and the BRSK2 suggests that they 



(a) 

1 - GGGGCCACGGGAGGGTCATGGGGG- 

2 GGGGGCCACGGGAGGGTCATGGGGGG 

3 - GGGGACACGGGAGGGTCATGGGGGG 

4 - GGGGTCACGGGAGGGTCATGGGGGG 

5 - -GGGGCACGGGAGGGTCATGGAGGG 

6 GGGGGCCACGGGAGGGTCATGTGGG- 

7 GGGGGCCGCGGGAGGGTCATGGGGG- 

8 - GGGACCACGGGAGGGTCCTGGGG- - 

9 GGGGACCACGGGAGGGTCATGGGGG- 
10 GGGGACCATGGGAGGGTCATGGGG- - 




1.00 0.95 0.90 U0.85 
Similarity Threshold 



0.80 



(b) 



1 - GGGACCACGGGAGGGTCCTCAGGGG 

2 - GGGACCACGG6AGGGTCGTCAGGGG 

3 - GGGACCACGGGAGGGTCTTCAGGGG 

4 GGGGACCACGGGAGGGTCATCAGGGG 

5 - GGGACCACGGGAGGGTGGTCAGGGG 

6 - GGGACCACGGGAGGGTTGTCAGGGG 

7 - GGGATCACGGGAGGGTCGTCAGGGG 

8 - GGGACCCGGGGAGGGTCGTCAGGGG 

9 - GGGACCGTGGGAGGGTCGTCAGGGG 

10 - GGGACCAAGGGAGGGTAGTCAGGGG 

11 - GGGAT- CGGGGAGGGTC - TCAGGGGCAAGTGGG 

1 2 - GGGATGC C GGGAGGGTC ATC AGGGG 




1.00 



0.95 10.90 B0.85 B0.80 
Similarity Threshold 



10.75 



Figure 5. (a) Cycle 18 cluster 202 zinc finger type genes 1. (b) Cycle 21 
cluster number 4672 zinc finger genes. 



Table 2. Cycle 18 cluster 202 zinc finger type genes 1 



Leaf no. 


Gene 


EnsembllD 


From 
start 


To end 


Feature 




1 


ZNF844 


ENSG00000223547 


300 


8830 


Intron 


I- 


-2 


2 


ZNF491 


ENSGOOOOO 177599 


289 


5499 


Intron 


1- 


-2 


3 


ZNF833 


ENSG00000197332 


286 


4031 


Intron 


1- 


-2 


4 


ZNF709 


ENSG00000242852 


247 


46 621 


Intron 


1- 


-2 




ZNF564 


ENSG00000196826 


37 833 


46 621 


Intron 


1- 


-2 


5 


ZNF709 


ENSG00000242852 


29 380 


17489 


Intron 


1- 


-2 




ZNF564 


ENSGOOOOO 196826 


66 966 


17489 


Intron 


I- 


-2 


6 


ZNF69 


ENSG00000198429 


279 


15 324 


Intron 


1- 


-2 


7 


ZNF627 


ENSG00000198551 


228 


16 643 


Intron 


I- 


-2 


8 


ZNF791 


ENSGOOOOO 173875 


211 


12383 


Intron 


I- 


-2 


9 


ZNF20 


ENSG00000132010 


290 


3960 


Intron 


1- 


-2 




ZNF625 


ENSG00000213297 


290 


3960 


Intron 5- 


-6 


10 


ZNF44 


ENSG00000197857 


6496 


15 598 


Intron 4- 


-5 



Table 3. Cycle 21 cluster number 4672 zinc finger genes 



Leaf no. 


Gene 


EnsembllD 


From 
start 


To end 


Feature 




1 


AC01 1477.1 


ENSG00000245381 


31 177 


25 473 


Intron 2- 


-3 


2 


ZNF100 


ENSGOOOOO 197020 


279 


1336 


Intron 1- 


-2 




ZNF681 


ENSGOOOOO 196 172 


264 


2906 


Intron 1- 


-2 


3 


ZNF431 


ENSG00000196705 


288 


1051 


Intron 1- 


-2 


4 


ZNF493 


ENSG00000196268 


286 


7549 


Intron 1- 


-2 


5 


ZNF492 


ENSG00000229676 


290 


18517 


Intron 1- 


-2 


6 


ZNF85 


ENSG00000105750 


282 


10313 


Intron 1- 


-2 


7 


ACO 10300.1 


ENSG00000235694 


71073 


70 836 


Intron 9- 


-10 




ZNF91 


ENSG00000167232 


287 


20248 


Intron 1- 


-2 


S 


ZNF254 


ENSG000002 13096 


272 


18 305 


Intron 1- 


-2 


9 


ZNF738 


ENSG000001 72687 


289 


2308 


Intron 1- 


-2 


10 


ZNF724P 


ENSG00000196081 


287 


17 634 


Intron 1- 


-2 


11 


MAP IB 


ENSG00000131711 


41 788 


26124 


Intron 2- 


-3 


12 


ZNF588 


ENSG00000196247 


322 


12 544 


Intron 1- 


-2 
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are all part of a larger region of similarity. That we have 
closely related genes with similar sequences within their 
introns is perhaps no great surprise; however, the existence 
of similar sequences within the introns of unrelated genes 
is an unexpected observation. 

Cycle 18 cluster 2 (Figure 9 and Table 7) contains 27 
sequences; however, many occur in more than one locus 
and the sequences in the cluster appear 140 times. 
Sequences 1 and 5 are the most common, occurring 45 
and 51 times, respectively. We used biomart (53) in the 
ENSEMBL website to retrieve the interpro (49) IDs for 
the genes involved (full details are given in the 
Supplementary Data). Of the 88 genes which had 
interpro mappings, several were related; however, none 
occurred more than seven times and the genes are 
distributed over a range of gene families. Among them 



1 GG6AT6AGGGGTGGGGG- TCCCAAGGGAGGG 

2 GGGGTGGGGG- TCCCAAGGGAGGG 

3 GGGATGAGGGATGGGGG- TCCCAAGGGAGGG 

4 GGGTTGGGGGGTC C C AAGGGAGGG 

5 GGGTTGGGGG- TCCCAAGGGAGGG 

6 GGGATGAGGGGTGGGGG- TCCTAAGGGAGGG 

7 GGGATGAGGGGTGGGGG- TCTCAAGGGAGGG 

1.00 0.95 B0.90 10.85 B0.80 
Similarity Threshold 

Figure 6. Cycle 18 cluster 1346. Cluster containing sequences that 
occur chiefly within leukocyte immunoglobulin-like receptor (LIR) 
genes. 



Table 4. Cycle 18 cluster 13 461. Cluster containing sequence 
which occur chiefly within leucocyte immunoglobulin-like receptor 
(LIR) genes 




Leaf no. 


Gene 


EnsembllD 


From 
start 


To end 


Feature 




1 


LILRA6 


ENSG00000244482 


77 


147 


Intron 


5- 


-6 




LILRB3 


ENSG00000204577 


77 


147 


Intron 


5- 


-6 




LILRB3 


ENSG00000204577 


18 899 


1733 


Intron 


7- 


-8 


2 


LILRA1 


ENSG00000104974 


617 


3193 


Intron 


5- 


-6 




LILRB1 


ENSG00000104972 


21906 


35 443 


Intron 


2- 


-3 




LILRP2 


ENSG00000240258 


84 


146 


Intron 


3^1 




AC006293.1 


ENSG00000170858 


84 


146 


Intron 


4- 


-5 


3 


LILRB1 


ENSG00000104972 


77 


148 


Intron 


5- 


-6 


4 


KCNH2 


ENSG0000O055118 


446 


30 


Intron 


8- 


-9 


5 


LILRA2 


ENSG00000239998 


84 


147 


Intron 


6- 


-7 




LILRB1 


ENSG00000 104972 


1525 


55 824 


Intron 


2- 


-3 


6 


LILRA4 


ENSG00000239961 


79 


147 


Intron 


5- 


-6 




LILRA3 


ENSG00000170866 


77 


146 


Intron 


9- 


-10 


7 


AC011515.1 


ENSG00000225370 


77 


153 


Intron 


2- 


-3 



are kinases, zinc-finger genes, RAB/RAS genes, WD-40 
domains and catenins. Many are known to be associated 
with signal-transduction pathways (RIN3, TBC1D19, 
RASGRF3, CDK14, ARHGAP6, CTNNA3 and 
CTNND2, to name a few) and many are involved in mi- 
tosis (CENPQ, PARD3B, ALMS1, SPTLC1, etc.). This 
cluster demonstrates a significant number of genes both 
related and unrelated, which contain similar and often 
identical PQ sequences. 

Using the FuncAssociate tool to characterize gene sets, 
we discovered that a number of clusters were 
over-represented in certain GO terms. The results are 
summarized in Table 8, which shows, for each cycle, the 
number of clusters whose sequences fell in more than 10 
ENSEMBL genes, the number of these which were over- 
represented in GO terms, the sum of the number of GO 
terms which were found to be over-represented in each 
cluster and the percentage of chosen clusters in which 
were found to be over-represented in GO terms. The per- 
centage of clusters examined which contained over- 



Table 5. Cycle 18 cluster 448. Cluster containing sequences which 
occur in immunoglobulin genes IGHA2 and IGHM 



Leaf no. 


Gene 


EnsembllD 


From 
start 


To end 


Feature 


1 


IGHA2 


ENSG000002 11890 


1459 


1736 


Intron 


1-2 


2 


TRIM29 


ENSG00000137699 


8326 


383 


Intron 


1-2 


3 


IGHA2 


ENSG00000211890 


979 


2231 


Intron 


1-2 


4 


IGHA2 


ENSG00000211890 


472 


2763 


Intron 


1-2 


5 


IGHA2 


ENSG000002 11890 


844 


2321 


Intron 


1-2 


6 


IGHA2 


ENSG000002 11890 


549 


2701 


Intron 


1-2 


7 


IGHA2 


ENSG00000211890 


1084 


2106 


Intron 


1-2 


8 


IGHA2 


ENSG000002 11890 


1749 


1486 


Intron 


1-2 


9 


IGHV3-6 


ENSG00000233855 


5773 


86 881 


Intron 


10-11 




IGHM 


ENSG00000211899 


2872 


2301 


Intron 


1-2 


10 


IGHV3-6 


ENSG00000233855 


3982 


88 648 


Intron 


10-11 




IGHM 


ENSG000002 11899 


1081 


4068 


Intron 


1-2 


1 1 


BRSK2 


ENSG000O0 174672 


517 


3328 


Intron 


12-13 


12 


IGHV3-6 


ENSG00000233855 


5943 


86716 


Intron 


10-11 




IGHV3-6 


ENSG00000233855 


5983 


86 676 


Intron 


10-11 




IGHM 


ENSG00000211899 


3042 


2136 


Intron 


1-2 




IGHM 


ENSG000002 11899 


3082 


2096 


Intron 


1-2 


13 


ZNF831 


ENSG00000 124203 


15164 


30 732 


Intron 


3^1 


14 


IGHV3-6 


ENSG00000233855 


4833 


87 826 


Intron 


10-11 




IGHV3-6 


ENSG00000233855 


4884 


87775 


Intron 


10-11 




IGHV3-6 


ENSG00000233855 


5274 


87385 


Intron 


10-11 




IGHM 


ENSG00000211899 


1932 


3246 


Intron 


1-2 




IGHM 


ENSG00000211899 


1983 


3195 


Intron 


1-2 




IGHM 


ENSG00000211899 


2373 


2805 


Intron 


1-2 



1 GGGC TGGACTGGGCTGGGCTGGGC TGGGC TGGGC TGAGCTGGGCTGGGCTAGATTGGGCTGGGCTGGACTGGGC TGGGC TGGG 

2 GGGCTGGGCTGGGC TGAGC TGGG 

3 GGGCTGGGCTGGGCTAGGCTGGGCTGGGCTGGGCTGGGCTGGGCTAGATTGGGCTGGGCTGGACTGGG 

4 GGGCTGAGC TGGGCTGGGCTAGGCTGGGC TGAGC TGGGC TGGG 

5 GGGCTGGGCTGGGC TGAGC TGGGC TGGGC TGGGC TGGGC TGGGCTGGGCTAGGCTGGGC TGGGC TGGGC TGGGC TGGGCTGGGCTAGGCTGGGC TGGGC TGGGC TGAGC TGGG- 

6 - GGGCTGGGCTGGGC TGGGC TGAGC TGGG - 

7 GGGC TGGGC TGGGC TGGGCTGGGC TGGGC TGGGC TGGGCTAGGC TGGGC TGGGC TGAGC TGGGC TGGGC TGGGC TGGGCTGAGCTGGG 

8 GGGC TAGGC TGGGC TGGGC TGAGC TGGGC TGGGC TGAGC TGGG 

9 - GGGC TGGGC TGAGC TGGGC TGGGC TGGG 

10 - GGGCTGGGC TGAGC TGGGC TGGGCTGGGC TGGGC TGAGCGGGTC TGAGC GGG 

11 GGGCAGGGC TGAGC TGGGC TGGG 

12 GGGC TGGGC TGAAC TGGGC TGGG 

13 - GGGC TGGGC TGAGC TGGGC TGGGGG 

14 GGGC TGGGC TAAC CTGGGC TGGG 

1.00 0.95 : 0.90 B0.85 BO 80 
Similarity Threshold 

Figure 7. Cycle 18 cluster 448. Cluster containing sequences that occur chiefly in immunoglobulin genes IGHA2 and IGHM. 
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1 GGGC TGGGC TGAGCTGGGCTGAACTGGG- 

2 GGGGTGGGCTGGGC TGAGC TGGGC TGAGC TGGGC TGGG - 

3 - -GGGGTGGGCTGGGC TGAACTGGG 

4 GGGC TGGGCTGGGC TGAGC GGGGC TGAGC GGGG- 

5 GGGTGGGC GGGGC TGGGC TGAGC TGGGC TGAGC TGGG- 

6 GGGC TGAGC TGGGC TGAGC TGGGC TGAGC TGGG - 

7 GGGC TGGGC TGAGC TGGGC TGAGC TGGGC TGAGC TGGG - 

8 GGGC TGGGC TGAGCTGGGCTAGGCTGGGC TGAGC TGGG- 

9 GGGTTTGGGCTGGGGTGGGTTGGGC TGAGC TGGGC TGAGC TGGG- 



my 



100 095 0.90 10.85 B0.80 
Similarity Threshold 

Figure 8. Cycle 18 cluster 1086. Cluster containing sequences that 
occur chiefly in immunoglobulin genes IGHA2 and IGHM. 



Table 6. Cycle 18 cluster 1086. Cluster containing sequences which 
occur chiefly in immunoglobulin genes IGHA2 and IGHM 



Leaf no. 


Gene 


EnsembllD 


From 
start 


To end 


Feature 


1 


IGHV3-6 


ENSG00000233855 


6237 


86417 


Intron 


10-11 




IGHM 


ENSG000002 11899 


3336 


1837 


Intron 


1-2 


2 


IGHA2 


ENSG000002 11890 


2392 


848 


Intron 


1-2 


3 


KCNK2 


ENSG00000082482 


61 113 


19 276 


Intron 


1-2 


4 


IGHV3-6 


ENSG00000233855 


4247 


88 402 


Intron 


10-11 




IGHM 


ENSG000002 11899 


1346 


3822 


Intron 


1-2 


5 


BRSK2 


ENSG00000 174672 


363 


3468 


Intron 


12-13 


6 


IGHA2 


ENSG000002 11890 


1654 


1591 


Intron 


1-2 


7 


IGHV3-6 


ENSGO0O00233855 


4052 


88 592 


Intron 10-11 




IGHM 


ENSG000002 11899 


1151 


4012 


Intron 


1-2 


8 


IGHV3-6 


ENSG00000233855 


4197 


88 447 


Intron 


10-11 




IGHM 


ENSG000002 11899 


1296 


3867 


Intron 


1-2 


9 


SMAD1 


ENSG00000170365 


10591 


14155 


Intron 2-3 



1 GGGGTGGG- GGGAAGGGGGAGGG 

2 GGGGTGGG- GGGAAGGGGGAGGGG 

3 GGGTGGG- GGGAAGGGGGAGGG 

4 GGGGTGGG- -GGGAGGGGGAGGG 

5 GGGGTGGG - GGGAGGGGG - AGGGG 

6 GGGGTGGG- GGGAGGGGG- AGGGGTAGCATTGGG 

7 GGGC TGGG- GGGAGGGGGGAGGG 

8 GGGC TGGG- GGGAGGGGGGAGGGG 

9 GGGGTGGG- GGGAGGGGGGAGGGGTAGCATTGGG 

10 GGGGTGGGGTGGG- GGGAGGGGGGAGGGGTAGCATTGGG 

11 GGGTTGGG- GGGACGGGGGAGGG 

12 GGGGTGGG- CGGAAGGGGGAGGG 

13 GGGGTGGG- CGGAGGGGGGAGGG 

14 GGGGTGGGGCGGAGGGGGGAGGG 

15 GGGGTGGG- GGGAGGGGGGAGGGACAGCATGGG 

16 GGGGTGGG- GGTAAGGGGGAGGG 

17 GGGTGGG- GGTAAGGGGGAGGG 

18 GGGGTGGG- GGTAGGGGGGAGGGG 

19 GGGGTGGG- GGTAGGGGGGAGGGGTAGCATTGGG 

20 GGGGTGGG- GGAAGCGGGGAGGG 

21 GGGGTGGG- GGAAGTGGGGAGGG 

22 GGGTGTGGG- GGAAGGGGGC AGGGG 

23 GGGTGGG- GGCAGGGTGGAGGG 

24 GGGTGGG - GGCAGGGTGGAGGGG 

25 GGGTGGG- GGCAGGGGGGAGGGGTAGCATTGGG 

26 GGGATGGG- GGCAGGGGGGAGGG 

27 GGGGAGGG- GGCAGGGGGGAGGGG 

1.00 0.95 D0.90 B0.85 
Similarity Threshold 




0.80 



Figure 9. Cycle IS 
lated genes. 



cluster 2. This shows sequences that occur in unre- 



represented GO terms did not vary dramatically for 
clusters 5-39 where it began at 12% for Cycle 5 and 
remained for the most part between 9% and 10%. At 
cluster 49 it began to rise, 18% for Cycle 49, 24% for 
Cycle 45 and 48% for Cycle 48. This approximately 
steady rate at Cycles 5-39 is probably due to the fact 
that while new clusters are being formed and genes 



Table 7. Cycle 18 cluster 2. Sequences which occur in unrelated 
genes. List of genes in which each sequence in this cluster occurs 

1 CHST9 CTNND2 PDZD2 TBC1D19 PDE4D TRIM5 PLCB1 

LRRC9 TMEM170B AP000705.4 LRRK2 SUMF1 NELL1 
MEGF10 FBN2 AP003355.2 VPS13B AC090922.1 AC096733.1 
RNF150 WDFY4 ALK SLC16A7 GLIPR1L2 SEMA3D EIF4G3 
PFTK1 C2orf34 DLG2 F8 RP11-451L9.1 FRMD4B ZNF28 
ZNF665 PDE4B AC010132.1 CTB-111H14.1 AC009264.1 
COL24A1 RP11-457K10.1 DYTN KCNE4 AC007254.1 
RAB3GAP2 RP11-479J7.2 

2 RIN3 

3 PTPRD 

4 NBEA FBXL5 KLF12 ADAMTS3 SLC24A3 RASGRF2 BICD1 

BEND7 AP000235.2 LACTB2 FAM190A Cllorf74 TBCK 
AMBRA1 PSMC1 TEX11 PPP2R2B KIAA2022 SPTLC1 
MAGT1 CTNNA3 ODZ3 UQCRFS1 AC008413.1 MON2 
Cllorf80 CENPQ NRCAM TRIM77 AC003050.1 ATRNL1 
FXYD6 RP11-702L6.4 SLC9A10 STXBP5L RP1 1-310E22.1 
CASC2 AC005582.1 ST6GALNAC3 RP4-630C24.1 LRP1B 
GALNT13 SLC25A24 RP1 1-439L18.3 AC018359.2 PTH2R 
AC079613.1 AC093865.2 RP11-542C10.1 RP11-202K23.1 
RP11-479J7.2 

5 RP11-735B13.1 

6 PK4P 

7 DLG2 

8 PTPRD 

9 AF127577.3 AGBL1 AFF2 CYP4B1 AC003090.1 FAM19A3 

PARD3B ALMS1P 

10 HERC2 

11 SLC26A7 

12 NRSN1 

13 EFCAB5 TFAP2D 

14 NAV3 

15 XKR4 

16 BBOX1 ALOX5 AL592494.3 

17 C2orf34 

18 TRPC4 

19 THSD4 

20 ARHGAP6 ALMS1 RP11-615J4.4 AC009499.1 MRPL33 

21 ARL15 ACCN1 PDSS2 JAK1 PDE4B RP3-433F14.1 

22 RP4-781K5.2 

23 KIAA0146 

24 COL5A3 

25 PDE3B 

26 MBD5 

27 RP11-202P11.1 



which are associated to common GO terms come together, 
other clusters which are over-represented in GO terms are 
being 'diluted' and the significance of the over-represented 
GO terms is being reduced. 

The raw cluster data will be available upon request from 
alan.todd@pharmacy.ac.uk and in the future from a 
webpage. 



DISCUSSION 

Introns have often been assumed to be mutationally 
neutral. However, there is growing interest in blocks of 
intronic regions which are conserved across species and 
which have been suggested as candidate areas of trans- 
acting regulatory regions (54-56). Although we have 
only examined a single species in the present analysis, 
the same reasoning can be applied to paraloguous 
regions as well as orthologues. Indeed, genes which are 
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Table 8. Number of clusters whose associated GO terms were found 
to be over-represented using FuncAssociate 
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65 
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20 


94 
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23 
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1768 


9.55882352941 


26 


268 


772 


2921 


9.17494008901 


29 


348 


862 


3620 


9.61325966851 


32 


311 


670 


3167 


9.82001894537 


36 


196 


392 


1824 


10.7456140351 


39 


111 


240 


1060 
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42 


86 


230 


506 


16.9960474308 


45 


54 


207 


238 


22.6890756303 


48 


50 


331 


100 


50.0 



co-expressed and have a common regulatory mechanism 
do not necessarily have to be paralogues; the same ex- 
acting promoter binding motifs, for example, often exist 
upstream of unrelated genes. From a sequence conserva- 
tion point of view, it is perhaps more remarkable to find 
large numbers of similar sequences in unrelated genes as in 
closely related ones. One could argue that it is possible for 
similar sequences in closely related genes to be simply pas- 
senger sequences which have not yet had time to diverge. 
In less closely related genes, it could even be argued that 
mutational cold spots (57) are responsible for some of the 
conserved sequence. Since we have identified clusters in 
genes whose members have a range of genetic distances 
from the closely related zinc finger genes in Figure 5 to the 
unrelated genes in Figure 9, we feel confident in stating 
that selective pressure is likely to be responsible for many 
of the sequence clusters observed here. The range of types 
of cluster and sequence types suggests that they have many 
different biological roles. 

Eddy and Maizels (4) showed that there was a relation- 
ship between gene function and the number of PQ se- 
quences found within those genes. By finding clusters 
which are over-represented in particular GO terms, we 
have shown that this type of relationship also applies at 
the sequence level and we can use the clusters to examine it 
further. 

By comparing the sequences in a multiple sequence 
alignment, we may see which elements are conserved and 
which are variable. If the sequence group forms a quad- 
ruplex structure then some of these conserved and variable 
regions may not be critical in quadruplex formation but 
may be critical bases for molecular recognition. In certain 
cases, this would be more useful than simply finding 
quadruplex-dependent positions. 

Whether one takes the abundance of similar PQs as 
evidence of selective pressure or not, the clustering data 
may still be exploited. For example, one of the key areas 



of G-quadruplex research currently focuses on developing 
ligands which block transcription by stabilizing a particu- 
lar quadruplex sequence. It may be important to know 
how unique that sequence is in order to provide specificity. 

Meaningfulness of clusters 

Since the clusters were merged using the full-linkage 
method, then the similarity threshold will be the lowest 
score between any pair of sequences in a cluster. At 
Cycle 16 (where most of the examples are from), the simi- 
larity threshold was 0.8. For a comparison of sequences 
where the shortest sequence is around 24 bases long, simi- 
lar to the majority of cases in the cluster in Figure 4, the 
worst alignments would have to contain, for example three 
mismatches and two gaps which would give a similarity 
score of 0.808. In practice, the majority of alignments in 
that cluster are much more similar and this generally 
appears to be the case. 

We have derived clusters of varying similarity and size, 
which raises the question of what represents a biologically 
relevant cluster. As the clustering progresses, less similar 
sequences are added to each cluster and at some stage the 
members will be merged, which do not have a similar bio- 
logical role. The point at which this occurs is impossible to 
determine without knowledge of the role of these se- 
quences or without experimental evidence. In the cases 
where we have discretely grouped clusters, rather than 
continuous merging through the clustering process, this 
should be less of a problem. We suggest that sequence 
types whose significance is determined in the future may 
have differing roles and so will require different degrees of 
similarity. Indeed, the cluster examples which we have 
presented were chosen because they represent a variety 
of different types of correlation: clusters which had a cor- 
relation with gene ontologies, those which correlated with 
protein families, clusters which belonged to disparate 
protein families and an example of a cluster which was 
found because of a particular interest (TERRA). The 
TERRA cluster is also an example which contains se- 
quences that are known to form stable DNA and RNA 
quadruplexes. 

The clustering in this study was performed on introns of 
human genes. It is now possible to examine other regions 
of genomic DNA with this methodology and search for 
clusters in, for example UTR regions, promoter regions or 
exons. The sequences which were clustered here are those 
which we selected using our criteria of four runs of at least 
three guanines, separated by loop regions. However, 
guanine quadruplex structures may not necessarily be 
formed exclusively from this sequence type. Indeed, in 
light of a recent structural study by Kuryavyi and Patel 
(58), we feel that a clustering approach using a yet more 
general rule for which sequences can potentially form 
quadruplex structures, will in due course bear fruit. This 
structure is not the only one to report a G-quadruplex 
with a topology which involves more than a simple 
sequence containing G-tracts separated by loop sequences; 
see in particular the molecular structures of the sequence 
in the promoter region of the c-kit gene (22-25). 
Clustering methods can be applied to any group of 



4926 Nucleic Acids Research, 2011, Vol. 39, No. 12 



sequences including, for example those which follow a 
specific template and those which are generally G-rich. 
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